Refactor/tornadovm planning by orionpapadakis · Pull Request #117 · beehive-lab/GPULlama3.java

orionpapadakis · 2026-05-28T12:59:06Z

This PR reorganizes TornadoVM execution planning around three variant axes:

model family
quantization
forward execution mode

The previous structure was mainly shaped around two axes: model family and quantization. With prefill-decode and batch-prefill-decode, execution mode becomes a third axis, which greatly increases the number of
combinations each model/quantization pair may need to support.

This refactor introduces forward plans, task-graph layouts, and model/quantization component providers so single-token, prefill-decode, and batch-prefill-decode paths can share one cleaner planning structure
instead of growing separate master-plan dispatch logic.

Notes

Adds Llama Q8_0 prefill-decode support which also exhibits the necessity of this PR.
Renames task-graph abstractions for clearer roles.
Moves scheduling helpers into a dedicated TornadoVM scheduling package.
Keeps graph topology and execution behavior unchanged outside the new prefill-decode path.

Verification

use java 21 or 25
setup tornadovm
mvn clean install
llama fp16 (single-token):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048
llama fp16 (prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode
llama fp16 (batch-prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32
llama fp16 (batch-prefill-decode-CUDA_GRAPHS):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-F16.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphs
llama q8_0 (single-token):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048
llama q8_0 (prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode
llama q8_0 (batch-prefill-decode):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32
llama q8_0 (batch-prefill-decode-CUDA_GRAPHS):
./llama-tornado --gpu --ptx --model ~/LLMModels/Llama-3.2-1B-Instruct-Q8_0.gguf --prompt "$LONG_PROMPT" --max-tokens 2048 --with-prefill-decode --batch-prefill-size 32 --cuda-graphs

any other model (mistral, qwen3 etc) should also pass with single-token config BUT should fail for any prefill-decode config with the following message:

WARNING: Using incubator modules: jdk.incubator.vector
Exception in thread "main" java.lang.UnsupportedOperationException: BATCH_PREFILL_DECODE not yet supported for QWEN_3 + F16
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createQwen3FP16Plan(ForwardPlanFactory.java:174)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createFP16Plan(ForwardPlanFactory.java:90)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.create(ForwardPlanFactory.java:74)
  at org.beehive.gpullama3.tornadovm.plan.ForwardPlanFactory.createBatchPrefillDecode(ForwardPlanFactory.java:65)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlanBatchPrefillDecode.createExecutionPlan(TornadoVMMasterPlanBatchPrefillDecode.java:70)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlanBatchPrefillDecode.<init>(TornadoVMMasterPlanBatchPrefillDecode.java:51)
  at org.beehive.gpullama3.tornadovm.TornadoVMMasterPlan.initializeTornadoVMPlan(TornadoVMMasterPlan.java:59)
  at org.beehive.gpullama3.model.Model.runInstructOnce(Model.java:205)
  at org.beehive.gpullama3.LlamaApp.runSingleInstruction(LlamaApp.java:18)
  at org.beehive.gpullama3.LlamaApp.main(LlamaApp.java:44)
Error: Command failed with return code 1

Reorganize TornadoVM execution planning around forward modes, model families, and quantization-specific components.

…TornadoVM components

…d `AbstractLogitsLayer` to `AbstractLogitsTaskGraph`, updating all references to improve clarity and align with naming conventions.

…ing with updated naming conventions.

…ill-decode and CUDA-graph variants

mikepapadim · 2026-05-30T10:42:52Z

+                MemorySegment tokenEmbeddings = weights.getTokenEmbeddingTable().asByteArray().getSegment();
+                int blocksPerToken = (configuration.dim() + 31) / 32;
+                long bytesPerToken = (long) blocksPerToken * 34;
+                MemorySegment.copy(tokenEmbeddings, (long) token * bytesPerToken,
+                        state.embeddingX.getSegment(), 0, bytesPerToken);
+            }


maybe this should be a method on each own. Same for the above

mikepapadim · 2026-05-30T10:43:51Z

    }
+
+    // ── Q8_0 Batch Kernels ───────────────────────────────────────────────────
+


format is odd. use @Formatter: on / off of the block and pass the autoformatter

mikepapadim · 2026-05-30T10:45:51Z

+    }
+
+    @Override
+    protected String predecessorGraphName(int layerIndex) {


again formatter - use annotations eitherwise in the first autoformatitng pass it will be got flat.

mikepapadim · 2026-05-30T10:47:06Z

+
+    // ── Embedding preparation ─────────────────────────────────────────────────
+
+    @Override public EmbeddingPreparer embeddingPreparer() {


add javadoc as this a new functionality no one else knows what it does.

mikepapadim · 2026-05-30T10:48:33Z

+    }
+
+    @Override public ActivationTaskGraph standardActivation() {
+        return new Activation("activationUpdate", state, weights, config);


maybe 'actiovationUpdate' and 'logits' strings should be in an enum or record that reuse that instead of have these Strings all over the place.

mikepapadim

LGTM, some minor changes needed.

orionpapadakis added 6 commits May 28, 2026 15:36

[prf/dec]Implement prefill-decode for Llama Q8_0

45204f1

Reorganize TornadoVM execution planning and improve naming conventions

8ebf91f

Reorganize TornadoVM execution planning around forward modes, model families, and quantization-specific components.

Update naming from ActivationGraph to ActivationTaskGraph across …

4e4478a

…TornadoVM components

Rename AbstractFFNLayers to AbstractTransformerLayerTaskGraphs an…

ea478f8

…d `AbstractLogitsLayer` to `AbstractLogitsTaskGraph`, updating all references to improve clarity and align with naming conventions.

Refactor FFN layer comments to transformer-layer task graphs, align…

e20ebc5

…ing with updated naming conventions.

[ci] Add workflows for Llama-3.2-1B-Instruct Q8_0 inference with pref…

c7522d1

…ill-decode and CUDA-graph variants

orionpapadakis requested review from mairooni, mikepapadim and stratika May 28, 2026 12:59

orionpapadakis added enhancement New feature or request refactoring prefill-decode labels May 28, 2026

mikepapadim reviewed May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor/tornadovm planning#117

Refactor/tornadovm planning#117
orionpapadakis wants to merge 6 commits into
mainfrom
refactor/tornadovm-planning

orionpapadakis commented May 28, 2026

Uh oh!

mikepapadim May 30, 2026

Uh oh!

mikepapadim May 30, 2026

Uh oh!

mikepapadim May 30, 2026

Uh oh!

mikepapadim May 30, 2026

Uh oh!

mikepapadim May 30, 2026

Uh oh!

mikepapadim left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		}

		// ── Q8_0 Batch Kernels ───────────────────────────────────────────────────


		// ── Embedding preparation ─────────────────────────────────────────────────

		@Override public EmbeddingPreparer embeddingPreparer() {

Conversation

orionpapadakis commented May 28, 2026

Notes

Verification

Uh oh!

mikepapadim May 30, 2026

Choose a reason for hiding this comment

Uh oh!

mikepapadim May 30, 2026

Choose a reason for hiding this comment

Uh oh!

mikepapadim May 30, 2026

Choose a reason for hiding this comment

Uh oh!

mikepapadim May 30, 2026

Choose a reason for hiding this comment

Uh oh!

mikepapadim May 30, 2026

Choose a reason for hiding this comment

Uh oh!

mikepapadim left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants